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Abstract 

Discrete Event Simulation (DES) is a widely used technique in 
which the state of the simulator is updated by events happening 
at discrete points in time (hence the name). DES is used to model 
and analyze many kinds of systems, including computer architec- 
tures, communication networks, street traffic, and others. Parallel 
and Distributed Simulation (PADS) aims at improving the effi- 
ciency of DES by partitioning the simulation model across multiple 
processing elements, in order to enable larger and/or more detailed 
studies to be carried out. The interest on PADS is increasing since 
the widespread availability of multicore processors and affordable 
high performance computing clusters. However, designing parallel 
simulation models requires considerable expertise, the result being 
that PADS techniques are not as widespread as they could be. In 
this paper we describe ErlangTW, a parallel simulation middleware 
based on the Time Warp synchronization protocol. ErlangTW is 
entirely written in Erlang, a concurrent, functional programming 
language specifically targeted at building distributed systems. We 
argue that writing parallel simulation models in Erlang is con- 
siderably easier than using conventional programming languages. 
Moreover, ErlangTW allows simulation models to be executed ei- 
ther on single-core, multicore and distributed computing architec- 
tures. We describe the design and prototype implementation of Er- 
langTW, and report some preliminary performance results on mul- 
ticore and distributed architectures using the well known PHOLD 
benchmark. 

Categories and Subject Descriptors D.1.3 [Software]: Program- 
ming Techniques — Concurrent Programming; 1.6.8 [Computing 
Methodologies]: Simulation and Modeling — Types of Simulation 

General Terms Languages, Performance 

Keywords Parallel and Distributed Simulation, PADS, Time 
Warp, Erlang 

1. Introduction 

Simulation is a widely used modeling technique, which is applied 
to study phenomena for which a closed form analytical solution 
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is either not known, or too difficult to obtain. There are many 
types of simulation: in a continuous simulation the system state 
changes continuously with time (e.g., simulating the temperature 
distribution over time inside a datacenter); in a discrete simulation 
the system state changes only at discrete points in time; finally, in 
a Monte Carlo simulation there is no explicit notion of time, as it 
relies on repeated random sampling to compute some result. 

Discrete Event Simulation (DES) is of particular interest, since 
it has been successfully applied to modeling and analysis of many 
types of systems, including of computer system architectures, com- 
munication networks, street traffic, and others. In a Discrete Event 
Simulation, the system is described as a set of interacting entities; 
the state of the simulator is updated by simulation events, which 
happen at discrete points in time. For example, in a computer net- 
work simulation the following events may be defined: (1) arrival of 
a new packet at a router; (2) the router starts to process a packet; 
(3) the router finishes processing a packet; (4) packet transmission 
starts; (5) a timeout occurs and a packet is dropped; and so on. 

The overall structure of a sequential event-based simulator is 
relatively simple: the simulator engine maintains a list, called Fu- 
ture Event List (FEL), of all pending events, sorted in non decreas- 
ing simulation time of occurrence. The simulator executes the main 
simulation loop; at each iteration, the event with lower timestamp t 
is removed from the FEL, and the simulation time is advanced to t. 
Then, the event is executed, which triggers any combination of the 
following actions: 

• The state of the simulation is updated; 

• Some events may be scheduled at some future time; 

• Some scheduled events may be removed from the FEL; 

• Some scheduled events may be rescheduled for a different time. 

The simulation stops when either the FEL is empty, or some 
user-defined stopping criteria are met (e.g., some predefined max- 
imum simulation time is executed, or enough samples of events of 
interest have been collected). The FEL is usually implemented as a 
priority queue, although different data structures have been consid- 
ered and provide various degree of efficiency [18]. 

Traditional sequential DES techniques may become inappro- 
priate for analyzing large and/or detailed models, due to the large 
number of events which can require considerable (wall clock) time 
to complete a simulation run. The Parallel and Distributed Sim- 
ulation (PADS) discipline aims at taking advantage of modern 
high performance computing architectures-from massively paral- 
lel com pute rs to multicore processors-to handle large models effi- 
ciently 1 14]. The general idea of PADS is to partition the simulation 
model into submodels, called Logical Processes (LPs) which can 
be evaluated concurrently by different Processing Elements (PEs). 
More precisely, the simulation model is described in terms of multi- 
ple interacting entities which are assigned to different LPs. Each LP 
that is executed on a different PE, is in practice the container of a 



set of entities. The simulation evolution is obtained through the ex- 
change of timestamped messages (representing simulation events) 
between the entities. In order to ensure that causal dependencies 
between events are not violated [19], each receiving entity must 
process incoming events in non decreasing timestamp order. 

We observe that multi- and many-core processor architectures 
are now ubiquitous; moreover, the Cloud computing paradigm 
allows users to rent high performance computing clusters us- 
ing a "pay as you go" pricing model. The fact that high perfor- 
mance computing resources are readily available should suggest 
that PADS techniques-which have been refined to take advantage 
precisely of that kind of resources-are widespread. Unfortunately, 
PADS techniques have not gained much popularity outside highly 
specialized user communities. 

There are many reasons for that (lO|l, but we believe that the 
fundamental issue with PADS is that parallel simulation models are 
currently not transparent to the user. Figure Q~](a) shows the (greatly 
simplified) structure of a DES stack. At the higher level we have 
the user-defined simulation model; the model defines the events 
and how they change the system state. In practice, the model is 
implemented using either general-purpose programming language, 
or languages specifically tailored for writing simulations (e.g., Sim- 
ula (9D, GPSS [H, Dynamo OH, Parsec (J, SIMSCRIPT III [H]). 
The simulation program depends on some underlying simulation 
engine, which provides core facilities such as random number gen- 
eration, FEL handling, statistics collection and so on. The simula- 
tion engine may be implemented as a software library to be linked 
against the user-defined model. Finally, at the lower level, the sim- 
ulation is executed on some hardware platform, which in general 
is a general-purpose processor; ad-hoc architectures have also been 
considered (e.g., the ANTON supercomputer i3Tll ). 

The current state of PADS is similar to Figure UJ (b). Dif- 
ferent parallel/distributed simulation libraries and middlewares 
have been proposed (e.g. ^sik 01, SPEEDES 02], PRIME 03], 
GAIA/ARTIS lllll ). each one specifically tailored for a particular 
environment or hardware architecture. While hardware dependency 
is unavoidable-shared memory parallel algorithms are quite differ- 
ent than distributed memory ones, for example-the problem here 
is that low level details are exposed to the user, which therefore 
must implement the simulation model taking explicitly into ac- 
count where the model will be executed. This seriously limits the 
possibility of porting the same model to different platforms. 

ErlangTW is a step towards the more desirable situation shown 
in Figure [T] (c). ErlangTW is a simulation library written in Er- 
lang [3], which implements the Time Warp synchronization proto- 
col for parallel and distributed simulations [17]. Erlang is a con- 
current programming language based on the functional paradigm 
and the actor model, where concurrent objects interact using share 
nothing message passing. In this way, the same application can po- 
tentially run indifferently on single-core processors, shared mem- 
ory multiprocessors and distributed memory clusters. The Erlang 
Virtual Machine can automatically make use of all the available 
cores on a multicore processor, providing a uniform communica- 
tion abstraction on shared memory machines. Also, multiple Erlang 
VMs can provide a similar abstraction also on distributed memory 
systems. Thanks to these features, the same ErlangTW simulation 
model can be executed serially on single-core processors, or con- 
currently on multicores or clusters. Of course, performance will 
depend both on the model and on the underlying architecture; how- 
ever, preliminary experiments with the PHOLD benchmark (re- 
ported in Section|5J show that scalability across different processor 
architectures can indeed be achieved. Moreover, future versions of 
ErlangTW will add support for the adaptive runtime migration of 
simulated entities (or whole LPs) using the serialization features 
offered by Erlang. An approach that, due to many technical diffi- 



culties, is not common in PADS tools but that often speeds up the 
simulation execution. 

This paper is structured as follows. In Section [2] we review the 
scientific literature and contrast our approach to similar works. In 
Section[3]we introduce the basic concepts of distributed simulation 
and the Time Warp protocol. In Section 5] we present the archi- 
tecture and implementation of ErlangTW. We evaluate the perfor- 
mance of ErlangTW using the PHOLD benchmark, both on a mul- 
ticore processor and on a small distributed memory cluster; perfor- 
mance results are described in Section [5] Finally, conclusions and 
future works will be presented in Section [6] 

2. Related Works 

Over the years, many PADS tools, languages and middlewares have 
been proposed (a comprehensive but somewhat outdated list can 
be found in 12011 ); in this section we highlight some of the most 
significant results with specific attention to the implementations of 
the Time Warp synchronization mechanism. 

/isik 1 24] is a multi-platform micro-kernel for the implementa- 
tion of parallel and distributed simulations. The micro-kernel pro- 
vides advanced features such as support for reverse computation 
and some kind of load balancing. 

The Synchronous Parallel Environment for Emulation and 
Discrete-Event Simulation (SPEEDES) 02] and the WarpIV Ker- 
nel 13311 have been used as testbeds for investigating new ap- 
proaches to parallel simulation. SPEEDES is a software frame- 
work for building parallel simulations in C++. SPEEDES pro- 
vides support for optimistic simulations by defining new data types 
for variable which can be rolled back to a previous state (as we 
will see in Section [3] this is required for optimistic simulations). 
SPEEDES uses the Qheap data structure for event management, 
which provides better performance with respect to conventional 
priority queue data structures. SPEEDES has also been used for 
many seminal works on load-balancing in optimistic synchroniza- 
tion. 

DSIM @] is a Time Warp simulator which targets clusters com- 
prised of thousands of processors and that implements some ad- 
vanced techniques for the memory management (e.g. Time Quan- 
tum GVT and Local Fossil Collection). 

We are aware of two existing simulation engines based on the 
Erlang programming language: Sim94 [6] and Sim-Diasca JJJ]. 
Sim94 has been originally developed for military leadership train- 
ing of battalion commanders, and is based on a client-server 
paradigm. The server runs the simulation model, while clients 
can connect at any time to inspect or change the simulation state. 
It should be observed that Sim94 implements a conventional se- 
quential simulator, while ErlangTW implements a parallel and dis- 
tributed simulator based on the Time Warp synchronization proto- 
col. Sim-Diasca, on the other hand, is a true PADS engine (simu- 
lation models can be executed on multiple execution units), but is 
based on a time-stepped synchronization approach. A time-stepped 
simulation is divided into fixed-length time steps; all execution 
units execute each step concurrently and synchronize before exe- 
cuting the next one (see Section^. Time-stepped simulations can 
be appropriate for systems whose evolution is "naturally" driven 
by a sequence of steps (e.g., circuit simulation evolving according 
to a global clock). Issues in time-stepped simulations include the 
need to find the appropriate duration of steps, and the high cost of 
synchronization. 

A recent work [ 12] investigated the use of the Go programming 
languag43 to implement an optimistic parallel simulator for mul- 
ticore processors. The simulator, called Go- Warp, is based on the 
Time Warp mechanism. Go provides mechanisms for concurrent 

2 http : //golang . org/ 



Simulation Model 



Simulation Engine 



Hardware 



X specific 
model 


Y specific 
model 


Z specific 
model 




Engine X 


Engine Y 


Engine Z 




Single Core 
CPU 


Multicore 
CPU 


HPC Cluster 



Simulation Model 



ErlangTW Engine 
Erlang VM 



Single Core 
CPU 



Multicore 
CPU 



HPC Cluster 



(a) 



(b) 



(c) 



Figure 1. Layered structure of discrete-event simulators 
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execution and inter-process communication, which facilitate the 
development of parallel applications. Like Erlang, all these mech- 
anisms are part of the language core and are not provided as ex- 
ternal libraries. However, Go- Warp can not be executed on a dis- 
tributed memory cluster without a major redesign; with this respect, 
ErlangTW represents a significant improvement, since the simu- 
lator runs without any modification on both shared memory and 
distributed memory architectures. To the best of our knowledge, 
Erlang has not been used to implement a Time Warp simulation 
engine. 

3. Distributed Simulation 

A Parallel and Distributed Simulation (PADS) can be defined as "a 
simulation in which more than one processor is employed" ll25ll . As 
already observed in the introduction, there are many reasons for re- 
lying on PADS: to obtain results faster, to simulate larger scenarios, 
to integrate simulators that are geographically distributed, to inte- 
grate a set of commercial off-the-shelf simulators and to compose 
different simulation models in a single simulator I115ll . 

The main difference between sequential simulation and PADS 
is that in the latter there is no global shared system state. A PADS 
is realized as a set of entities; an entity is the smallest component 
of the simulation model, and therefore defines the model's granu- 
larity. Entities interact with each other by exchanging timestamped 
events. Entities are executed inside containers called LPs. Each LP 
dispatches the events to the contained entities, and interacts with 
the other LPs for synchronization and data distribution. In practice, 
each LP is usually executed by a PE (e.g., a single core in modern 
multicore processors). Each LP notifies relevant events to other LPs 
by sending messages using whatever communication medium is 
available to the PEs. Each message is a pair (t, e), where e is a 
descriptor of the event to be processed, and t is the simulation time 
at which e must be processed. Of course, the message header in- 
cludes additional information, such as the ID of the originator and 
destination entities. 

The situation is illustrated in Figure [2] Each LP contains a set 
of entities, and a queue of events which are to be executed by the 
local entities. The event queue plays the same role of the FEL 
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Figure 3. An example of causality violation 



of sequential simulations: the LP fetches the event with lower 
timestamp and forwards it to the destination entity. If an entity 
creates an event for a remote entity, the LP uses an underlying 
communication network to send the event to the corresponding 
remote LP. 

The term "parallel simulation" is used if the PE have access to a 
common shared memory, or in presence of a tightly coupled inter- 
connection network. Conversely, "distributed simulation" is used in 
case of loosely coupled architectures (i.e. distributed memory clus- 
ters) I25II . In practice, modern high-performance systems are often 
hybrid architectures where a large number of shared memory mul- 
tiprocessors are connected with a low latency network. Therefore, 
the term PADS is used to denote both approaches. 

It is important to observe that, even if a shared system state 
is indeed available on shared memory multiprocessor, the state is 
still partitioned across the PE in order to avoid race conditions and 
improve performance. 

Model partitioning Partitioning the model is nontrivial, and in 
general the optimal partition strategy may depend on the structure 
and semantic of the system to be simulated. For example, in a 
wireless sensor network simulation where each sensor node can 
interact only with neighbors, it is reasonable to partition the model 
according to geographic proximity of sensors. Many conflicting 
issues must be taken into account when partitioning a simulation 
model into LPs. Ideally, the partition should minimize the amount 
of communication between PEs; however, the partition should also 
try to balance the workload across different PEs, in order to avoid 
bottlenecks on overloaded PEs. Finally, it is necessary to consider 
that a fixed partitioning scheme may not be appropriate, e.g., when 
the interactions among LPs change over time. In this scenario, some 
form of adaptive partitioning should be employed but this feature 
is not provided by most of currently available simulators. 

Synchronization The results of a PADS are correct if the outcome 
is identical to the one produced by a sequential execution, in which 



all events are processed in nondecreasing timestamp order (we 
assume that we can always break ties to avoid multiple events to 
occur at the exact same simulation time). In PADS, each LP i keeps 
a local variable LVTi called Local Virtual Time (LVT), which 
represents the (local) simulation time. LP i can process message 
(t , e) if t > LVTi ; after executing the event e, the LVT is set to t. 

It should be observed that the LVT of each LP advances at a 
different rate, due to load unbalance or communication delays. This 
may cause problems, such as the one shown in Figure[3] We depict 
the timelines associated to three LPs, LPi, LP 2 and LP3. The 
numbers on each timeline represents the LVT of each LP. Arrows 
represent events; for simplicity, all messages are timestamped with 
the sender's LVT. 

When LP2 receives (7, e-z) from LPi, it sets LVT2 = 7 and 
executes the event e^. Then, LP2 advances its LVT to 10, and sends 
a new message (10, e^) to LP\ . After that, message (8, e±) arrives 
from LP3; e4 can not be executed, since LVT 2 has already been 
advanced to 10. Moreover, LP2 sent out a message (10, 63) for 
event es, which may or may not have been generated should 
have been executed before in the correct order, before e^. 

Figure [3] shows an example of causality violation 11% . Two 
events are said to be in causal order if one of them can have some 
consequences on the other. In PADS, different synchronization 
strategies have been developed to guarantee causal ordering of 
events: time-stepped, conservative and optimistic. 

In a time-stepped simulation, the time is divided in fixed-size 
steps, and each LP can proceed to the next timestep only when 
all LPs have completed the current one [ 34]. This approach is quite 
simple, but requires a barrier synchronization at each step; the over- 
all simulation speed is therefore always dominated by the slow- 
est LP. Furthermore, defining the "correct" value of the timestep 
can be difficult if not impossible for some models. 

The conservative approach prevents causality violations from 
occurring. A LP must check that no messages from the past can ar- 
rive, before executing an event. This is achieved using the Chandy- 
Misra-Bryant (CMB) 1 22] algorithm, which imposes the following 
constraints: (i) each LP has an incoming queue for all other LPs 
from which it can receive messages; (») each LP must generate 
events in non decreasing timestamp order; (Hi) the delivery of the 
events is reliable (no message can be lost) and the network does 
not change the message order. Under these assumptions, each LP 
checks all the incoming queues to determine what is the next safe 
event to be processed. If there are no empty queues, then the incom- 
ing event with lower timestamp is safe and can be executed. Unfor- 
tunately, this mechanism is prone to deadlock, since a LP can not 
identify the next safe event if all incoming queues are nonempty. To 
avoid this, the CMB algorithm introduces a new type of message 
(called NULL messages) with no semantic content. The receipt of 
a NULL message (t, NULL) informs the receiver that the sender 
has set its LVT to t, and hence will not send any event with times- 
tamp lower than t. NULL messages can be used to break dead- 
locks, at the cost of increasing the network load. Moreover, genera- 
tion of NULL messages requires some knowledge of the simulation 
model, and therefore can not be transparent the user. 

Finally, the Time Warp protocol ltl7ll implements the so called 
optimistic synchronization approach. In Time Warp, each LP can 
process incoming events as soon as they are received. Obviously, 
causality violations may happen, and special actions must be taken 
to fix them. If a LP receives a message (called straggler) with 
timestamp smaller than some event already processed, it must roll 
back the computations for these events and re-execute them in the 
proper order. The problem is that some of the events to be undone 
might have sent messages (events) to other LPs (e.g., (10, e^) in 
Figure [3}- These messages must be invalidated by sending corre- 
sponding anti-messages. The recipient of an anti-message (t, e) 



must roll back its state as well, which might trigger a cascade of 
rollbacks that brings back the simulator to a previous state, dis- 
carding the incorrect computations that have been performed. 

In order to support rollbacks, each LP must keep a log of all 
processed events and all messages sent, together with any informa- 
tion needed to undo their effects. Obviously, logging all and every 
event since the beginning of the simulation is infeasible, due to the 
huge memory requirement. For this reason, the simulator periodi- 
cally computes the Global Virtual Time (GVT), which is a lower 
bound on the timestamp of any future rollback. The GVT is simply 
the smallest timestamp among unprocessed and partially processed 
messages, and can be computed with a distributed snapshot algo- 
rithm [ 15]. Once the GVT has been computed and sent to all LPs, 
logs older than GVT can be reclaimed. GVT computation can be 
a costly operation, since it usually involves some form of all-to-all 
communications. Therefore, finding the optimal frequency of this 
operation is a critical aspect of Time Warp and typically the chosen 
frequency is the result of a tradeoff between memory consumption 
for the logs and simulation speed. However, when the underlying 
execution architecture provides efficient support for reduction oper- 
ations, the GVT computation does not add too much overhead, and 
the Time Warp protocol can achieve almost linear speedup even on 
very large setups j26ll . 

Optimistic synchronization offers some advantages with respect 
to conservative approaches: first, optimistic synchronization is gen- 
erally capable of exploiting a higher degree of parallelism; second, 
conservative simulators require model specific information in order 
to produce NULL messages, while optimistic mechanisms are less 
reliant on such information (although they can exploit it if avail- 
able) d. 

4. The ErlangTW Simulator 

Erlang is a functional, concurrent programming language based 
on lightweight threads (LWT) and message passing. This makes 
it well suited for developing parallel applications both on shared 
memory multicore machines and on a distributed memory cluster. 
An Erlang program is compiled to an intermediate representation 
called BEAM, which is executed on a Virtual Machine. If Symmet- 
ric Multiprocessing is enabled, the VM creates a separate scheduler 
for each CPU core; each scheduler fetches one ready LWT from a 
common queue and executes it. The spawn function can be used 
to create a new thread executing a given function. The VM will 
take care of dispatching threads to active schedulers. The fact that 
there is no 1:1 mapping between LWT and OS threads facilitates 
the work of the developer, since the VM takes care of balancing the 
load across the available processors. 

Each LWT has an identifier that is guaranteed to be unique 
across all VM instances, even those running on different hosts 
connected through a network. The identifier can be used by 
send/receive primitives, which are provided directly by the lan- 
guage itself and do not require external libraries. 

The ErlangTW Simulator is an implementation of the Time 
Warp algorithm described in Section [3] Although Time Warp re- 
quires fairly sophisticated state management capabilities to support 
rollbacks and antimessages, it turned out that this (fairly limited) 
complexity is paid back by the fact that Time Warp does not re- 
quire ad-hoc modifications of simulation models (e.g., to compute 
NULL events). 

Message Format Messages exchanged between LPs are repre- 
sented using the record data type, providing the abstraction of a 
key-value tuple. Messages have the following structure: 

-record (message , {type, 

seqNumber , 
IpSender , 



lpReceiver , 
payload, 
timestamp}) . 

The type field represents the message type; current types are: 
event (normal event), ack (acknowledgement to ensure reliable de- 
livery of messages), markedjick (special kind of acknowledgement 
required by the Samadi's algorithm, described later), and antimes- 
sage (used during rollbacks). seqNumber is a numeric value repre- 
senting how many messages the sender LP has sent, lpReceiver 
and IpSender are the unique identifiers of the sender and re- 
ceiver LP. payload is the actual content of the message, describ- 
ing the event to process and all ancillary data. Finally, timestamp 
is simulated time associated to the event contained in the payload. 
The simulator needs to acknowledge messages in order to guar- 
antee the correctness of its global state, because each message in 
the system must be taken into account by one LP only. The Er- 
lang VM guarantees message delivery, but only from an LWT to 
another one's mailbox, therefore this could lead to the situation in 
which an LP has received a particular message but it has not already 
read it, so it is unaware of its presence. Conversely once an LP re- 
ceives an acknowledge for a message it knows that it has already 
been taken into account by the receiver. An example of global state 
is the Global Virtual Time, explained in the following. 

Here an example of a message: 

#message{type=event/ack/marked_ack/antimessage , 
seqNumber=100 , 
lpSender=<100 , , 0> , 
lpReceiver=<100 , 1 , 0> , 
payload="hello" , 
timestamp=10} 

Event Queue Each LP maintains a priority queue of incoming 
messages sorted in nondecreasing timestamp order. The LP fetches 
the message with lower timestamp from the queue and, if the 
message is not a straggler, immediately executes the associated 
event. The queue is implemented as an Andersson General Bal- 
anced Tree |2]. The tree contains (Key, Value) pairs, where the Key 
is the simulation time, and the Value is a list of events which are 
scheduled to happen at that time (ErlangTW supports simultaneous 
events, i.e., multiple events happening at the same simulated time). 

Logical Processes Each LP is implemented as an Erlang LWT 
created using the spawn function. LPs communicate using the send 
and receive operators. The state of an LP is kept in a record with 
the following structure: 

-record (lp_status , {my_id, 

received_messages , 
inbox_mes sages , 
max_received_messages , 
proc_messages , 
to_ack_messages , 
anti_messages , 
current_event , 
history, gvt, 
rollbacks , 
timestamp, 
model_state , 
init_model_state , 
samadi_f ind_mode , 
s amadi _marked_me s s age s _min , 
messageSeqNumber , 
status}-) . 

where: 



myjd is the unique identifier of the LP; 

received-messages is the list of unprocessed messages, read from 
the process mailbox; 

inboxjnessages is the incoming message queue containing unpro- 
cessed messages; 

procjnessages is a data structure which contains, for each pro- 
cessed event, the list of messages sent by that event to remote 
entities. This data structure is required to perform rollbacks 
when necessary, because it contains the event to reprocess and 
the antimessages to send; 

to_ack_messages is a list of events, sorted in nondecreasing times- 
tamp order, related to the messages sent by the LP still to be 
acknowledged; 

modeLstate is the user-defined structure containing the state of the 
simulation model; 

timestamp is the LVT; 

history is the list of processed events, used by the Time Warp 
protocol to perform rollbacks when necessary. Each element of 
the list is a tuple of the form {Timestamp, modeLstate, Event}, 
and record the state of this LP at the given simulation time, 
before the Event has been processed. A tuple is added to the 
history after an event has been extracted from inboxjnessages 
and executed; 

samadL* data structures needed in order to implement the Samadi's 
GVT algorithm, as stated in the next paragraph. 

Implementing Simulated Entities As already described in Sec- 
tion [3] a LP is a container of simulation entities. Each entity is 
the representation of some actor or component of the "real" sys- 
tem. By decoupling LPs from entities, the simulation modelers can 
avoid dealing with partitioning; however, if more control over the 
simulator is desired, the modelers can implement their own custom 
partitioning by working at the LP level. 

In ErlangTW there is a layer between LP and entities, in order to 
implement the separation of concerns described above. The mod- 
eler implements three methods in a particular Erlang module called 
user; these methods define the actions executed by each LP dur- 
ing initialization, event processing, and termination. The PHOLD 
model (described in Section |5J uses an initialization function to 
evenly partition the entities between the running LPs. The event 
processing function implements the behavior executed by each en- 
tity upon receipts of a new message. Finally, the termination func- 
tion is normally used to display or save simulation results or other 
information at the end of each simulation run. Each message con- 
tains a field called payload that could transport any kind of user- 
defined data. As a specific example, the event data structure used by 
the PHOLD model to manage entities has the following structure: 

-record (payload , 

{entitySender , entityReceiver , value}) . 

and can be instantiated, for example, as follows: 

#payload{entitySender=10 , 

entityReceiver=122, 
value=42} 

In this example entity 10 has sent a message to the entity 122 
with a payload containing the integer 42. In the current implemen- 
tation of ErlangTW, where the allocation of entities on LPs must 
be manually defined, the user specifies a mapping function which 
is used by ErlangTW to deliver message to the appropriate LP. In 
future versions we plan to implement some automatic allocation 
mechanism and to provide this binding transparently. 



Global Virtual Time The Global Virtual Time is calculated with 
Samadi's algorithm 13011 . One LWT, called GVT Controller, is 
responsible to periodically checking the smallest timestamp of all 
events stored in the queues of all LPs; the GVT controller is also 
responsible for starting and stopping the simulation. In the current 
version of ErlangTW, the GVT controller periodically broadcasts 
a GVT computation request message to all LPs; each LP sends 
back the value of the LVT such that the controller can compute 
the GVT as the minimum of these values. The GVT is finally sent 
to all LPs, which can then prune their local history by removing all 
checkpoints older than the GVT. 

In practice, the calculation of the GVT is complex given that 
some messages could be in flight when the sender and/or the re- 
ceiver LPs are reporting their LVT. Ignoring these messages would 
result in a wrong (overestimated) GVT and hang the whole simu- 
lation. The solution proposed by Samadi is to add an acknowledg- 
ment for each message used for the GVT calculation, to properly 
identify in flight messages and to decide what LP must take them 
in account. 

In future versions of ErlangTW we plan to compute the GVT 
using a more scalable reduction operation. 

Random Number Generation The pseudo random number gen- 
erator used by each simulated entity is the Linear Congruential 
Generator described by Park and Miller in [23]. The initial seed 
can be stored in a configuration file which is read by ErlangTW 
before starting the simulation run. Each entity within the same LP 
shares a common random number generator, whose seed is initial- 
ized with the seed in the configuration file. In this way it is possible 
to start the simulator in a known state, to achieve determinism and 
repeatability. 

5. Performance Evaluation 

In this section we evaluate the scalability of ErlangTW, both on 
shared memory and distributed memory architectures, using a syn- 
thetic benchmark called PHOLD fl4fl . which is specifically de- 
signed for the performance evaluation of Time Warp implemen- 
tations. 

The PHOLD Benchmark PHOLD is the parallel version of the 
HOLD benchmark for event queues iTHl and it is quite simple to 
implement and describe. The model is made by a set of E entities 
that are partitioned among L LPs; each LP contains the same num- 
ber E/L of entities. Each entity produces and consumes events. 
When an entity consumes an event, a new event is generated and de- 
livered to another entity (note that the total number of events in the 
system remains constant). The timestamp of the new event is com- 
puted by adding an exponentially distributed random number with 
mean 5.0 to the timestamp of the receiving event. In this model the 
recipient is randomly chosen using a uniform distribution. There- 
fore, each event has a probability 1/L of being sent to an entity 
in the same LP as the originator, and a probability (L — of 
being sent to an entity on a different LP. As the number L of LP in- 
creases, the ratio of remote vs local events increases. The PHOLD 
benchmark is homogeneous in terms of load assigned to the LPs: 
all of them have the same amount of communication and computa- 
tion. While this can be unrealistic for general simulation models, it 
is important to remark that the Time Warp mechanism (in its orig- 
inal version) does require a good level of balancing to obtain good 
performance results IT7t flQfl . Hence, the goal of PHOLD is to study 
the scalability of Time Warp implementations by considering an 
appropriate execution environment. 

There are four main parameters which are used to control the 
benchmark: 

• The number L of LP 



Number of LPs (L) 

Number of entities (E) 
Event Density (p) 
Workload 



1, . . . , 8 (shared memory) 

1, 2, 3, 6 (distributed memory) 

840, 1680, 2520, 3360 

0.5 

1000, 5500, 10000 FPops 



Table 1. Parameters used in the simulations 



• The number E of entities 

• The event density p, < p < 1, defined as the fraction of 
entities that generate an event at the beginning of the simulation. 
At each simulation time there are pE events in the system 

• The workload, used to tune the computation / communication 
ratio by running some CPU-intensive computation each time an 
event is processed. In our case, we implemented the workload 
as a pre-defined number of floating point operations (FPops) 

Experimental Setup Table [T] shows the parameters which have 
been used in the simulation runs. We tested ErlangTW both on a 
shared memory and on a distributed memory architecture. 

The number of entities E has been chosen as multiples of 840, 
which is the minimum common multiple of the number of LPs we 
considered (i.e., 840 is an integer multiple of all integers in the 
range 1, . . . , 8). This ensures that the number of entities allocated 
to each LP, E/L, is an integer. 

As already described, the event density has been set to 0.5, 
which means that, at a given time, the average number of events in 
the system is 0.5 x E. We considered three different workloads of 
1000, 5500 and 10000 Floating Point Operations. Finally, the GVT 
is computed every 5 seconds. 

We measured the wall clock time of a simulation run until 
the GVT reaches 1000. In order to produce statistically valid re- 
sults, we perform 30 runs for each experiment, and compute the 
average of each batch. We investigate the scalability of ErlangTW 
by computing the speedup as a function of the number L of LP. 

ErlangTW on Shared Memory The shared memory system 
(gda i7) is an Intel(R) Core(TM) i7-2600 CPU 3.40GHz with 
4 physical cores with Hyper-Threading (HT) technology l2lll . 
The system has 8 GB of RAM and runs Ubuntu 12.04 (x86_64 
GNU/Linux, 3.2.0-24-generic #39-Ubuntu SMP). For this system 
we considered several values for L, namely L = 1, . . . , 8 LPs. HT 
works by duplicating some parts of the processor except the main 
execution units. From the point of view of the Operating System, 
each physical processor core corresponds to two logical processors. 
The impact of virtual cores on PADS is worth investigation I5l [l2ll 
and will be reported in the following. 

Figure [4] shows the speedup Sl as a function of L; recall that 
Sl = Ti/Tl, where T n is the wall clock simulation time when 
n LPs are used. In each figure we consider a specific value for 
the workload, and we plot a curve for each number of entities 
E. As a general trend we observe that scalability improves as the 
number of entities gets large; also, scalability improves marginally 
if the workload (FPops) increases. Figure [5] shows the efficiency 
Eff L = Sl / L as a function of L. The efficiency is an estimate of 
the fraction of actual computation performed by all processors, as 
opposed to communication and synchronization. 

ErlangTW exhibits good scalability and efficiency up to L — 4, 
since in this case each LP can be executed on a separate physical 
processor core. The transition from L = 4 to L — 5 shows a 
noticeable drop of the speedup (and therefore in the efficiency), 
which is easily explained by the effect of HT. When L = 5, 
one of the physical CPU cores executes two LPs and becomes 
the bottleneck. The Time Warp protocol works well when the 



Host 


CPU 


Physical Cores 


HT 


RAM 


Operating System 


Network 


gda i7 


Intel 17-2600 3.40GHz 


4 


Yes 


8GB 


GNU/Linux Kernel 3.2 (x86 64) 


Not used 


cassandra 


Intel Xeon 2.80GHz 


2 


Yes 


3GB 


GNU/Linux Kernel 2.6 (x86 32) 


Gigabit Ethernet 


cerbero 


Intel Xeon 2.80GHz 


2 


Yes 


2GB 


GNU/Linux Kernel 2.6 (x86 32) 


Gigabit Ethernet 


chernobog 


Intel Xeon 2.40GHz 


4 


No 


4GB 


GNU/Linux Kernel 2.6 (x86 64) 


Gigabit Ethernet 



Table 2. Experimental testbeds (top: shared memory; bottom: distributed memory) 




Figure 4. Speedup on the shared memory architecture as a function of the number of LPs (higher is better) 




Figure 5. Efficiency on the shared memory architecture as a function of the number of LPs (higher is better) 
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Figure 6. Total number of rollbacks on the shared memory architecture as a function of the number of LPs (lower is better) 



workload is well balanced, but degrades significantly if hot spots 
are present 0]. 



To better understand this, we report in Figure |6]the mean total 
number of rollbacks which occurred during the whole simulation 
run. A large number of rollbacks indicates that the LVT at the 



individual LPs are advancing at different rates. The PHOLD model 
is balanced by construction, since all entities perform identical 
tasks and are uniformly distributed across the LPs. From Figure [6] 
we see that the number of rollbacks increases in the region L = 
1,...,4; if L = 1 no rollbacks happen, since all events are 
managed through the event queue of a single LP, so that causality 
is always ensured. Adding more LPs increases the possibility of 
receiving a straggler. From L — 4 to L = 5 load unbalance occurs 
and the number of rollbacks sharply increases. The LPs running 
on the overloaded processor core lag behind the other LPs, and 
a large number of antimessages is produced to undo the updates 
performed by the faster LPs. As the number of LPs further increase, 
we observe that the number of rollbacks decreases, since the system 
becomes more and more balanced. 

In practice it is extremely difficult, if not impossible, to stati- 
cally partition a PADS models such that the workload is balanced 
across the LP, since the computation / communication ratio can 
change during the simulation. If detailed knowledge of the simu- 
lation model is not available in advance, as it is the case most of 
the times, it is necessary to resort to adaptive entity migration tech- 
niques to balance the LPs 1 11]. It is worth mentioning that Erlang 
offers native support for code migration, which greatly simplify the 
implementation of such techniques; this will be the focus on future 
extensions of this work. 

ErlangTW on Distributed Memory The distributed memory sys- 
tem is the research cluster of the PADS group at the University 
of Bologna. We used three machines, cassandra, cerbero and 
chernobog whose configuration is shown in Table [2] We per- 
formed experiments with L = 1,2, 3, 6. For L = 1, the LP ex- 
ecuted on cassandra; for L = 2, one LP executed on cassandra 
and the other one on cerbero. For L = 3 we run a single LP 
on each of the three machines. Finally, when L — 6 we executed 
two LPs on each of the three machines. 

Figure [7j shows the speedup of the PHOLD model, measured 
on our distributed memory cluster. Thanks to the Erlang language, 
it was possible to execute the exact same implementation which 
was tested on the shared memory machine. Again, each value is 
obtained by averaging 30 simulation runs. The most prominent 
feature of these figures is the superlinear speedup which occurs 
with L — 2 and L — 3 LPs. As in most of these situations, 
this superlinear speedup can be explained by the fact that the 
machine used for the test with L = 1 (cassandra) has limited 
memory, and therefore makes use of virtual memory during the 
simulation. To confirm this hypothesis, we reduced the amount of 
memory required by the PHOLD model by reducing the wall clock 
time between GVT calculations. Recall from Section [5] that, once 
the GVT is known, each LP can discard logs for events executed 
before the GVT, since these events will be never rolled back. 
Therefore, increasing the frequency of GVT calculation results in 
a reduced memory footprint of the simulation model, at the cost of 
a higher number of communications. The test shown in Figure [7j 
were done with the GVT computed every 5s of wall clock time; 
reducing this interval to Is produces the more reasonable results 
shown in Figure[8] 

Scalability on the distributed memory cluster is quite poor, as 
confirmed by the efficiency shown in Figure [9] This result can be 
explained by observing that PADS applications often exhibit low 
computation / communication ratio, and in our distributed memory 
testbed the communication network uses the standard Gigabit Eth- 
ernet protocol which suffers from non negligible latency. Note from 
Figure|9]that scalability and efficiency are particularly poor for low 
workload intensities (1000 and 5500 FPops) and for low number of 
entities. In these situations PHOLD is communication bound, and 
the latency introduced by the commodity LAN severely impacts on 
the overall performance. 



Since our cluster includes heterogeneous machines, there load 
is not evenly balanced across the LPs, and this generates a large 
number of antimessages. In Figure [10] we plot the mean total num- 
ber of rollbacks as a function of the number of LPs L. The number 
of rollbacks sharply increases from L = 2 to L = 3, and this can 
be explained by the fact that for cassandra and cerbero have a 
similar hardware configuration, while chernobog (which is used 
when L = 3 and L = 6) is much more powerful. As in the shared 
memory case, the faster LPs is prone to produce a large number of 
stragglers which generate a cascade of rollbacks. 

6. Conclusion and future work 

In this paper we described ErlangTW, an implementation of the 
Time Warp protocol for parallel and distributed simulations imple- 
mented in Erlang. ErlangTW allows the same simulation model to 
be executed (unmodified) either on single-core, multicore and dis- 
tributed computing architectures. We described the prototype im- 
plementation of ErlangTW, and analyzed its scalability on a multi- 
core, shared memory machine and on a small distributed memory 
cluster using the PHOLD benchmark. 

Results show that Erlang provides a good framework to build 
simulators, thanks to its powerful language features and virtual 
machine facilities; furthermore, Erlang's transparent message bro- 
kering system greatly simplifies the development of complex dis- 
tributed applications, such as PADS. Performance of the PHOLD 
benchmark show that scalability and efficiency on shared memory 
architectures are very good, while distributed memory architectures 
are less friendly-performance wise-to these kinds of applications. 

As seen before, the communication overhead of the distributed 
execution environment has a big impact on the simulator perfor- 
mances and the Time Warp synchronization algorithm reacts badly 
to imbalances in the execution architecture (e.g. CPUs with very 
different speeds or presence of background load). Both these prob- 
lems can be addressed using nice features provided by Erlang: the 
serialization of objects and data structures, and code migration. 
Thanks to this, it is possible to implement the transfer of simu- 
lated entities across different LPs or even moving a whole LP on 
a different CPU, all at runtime. In this way, the ErlangTW simula- 
tor would be able to reduce the communication cost by adaptively 
clustering highly interacting entities within the same LP. Further- 
more, it will be possible to implement other advanced forms of 
load-balancing |11] to speed up the execution and to reduce the 
number of roll-backs. This will permit the implementation of new 
adaptive simulators that can change their configuration at runtime. 
To further enhance the performance of ErlangTW, we will exploit 
additional parallelization of the LP, by decoupling message dis- 
patching from entity management using separate LWT 

Source Code Availability 

The ErlangTW Simulator is released under the GNU General Pub- 
lic License (GPL) version 2 and can be freely downloaded from 
http: //pads . cs .unibo . it/ 
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